Executive Summary

Between February of 2021 and April of 2022, a Penumbra-sponsored study collected medical data, encompassing social history to genetics from 400 different stroke patients. The goal of the study was to discover medically relevant details about patients that could further the understanding of stroke severity and treatment. This analysis explored over 10GB of data and found evidence that helps explain the variability in stroke severity among male and female patients.

Project Overview

Background

Stroke is a leading cause of long term disability in the United States. According to the CDC, more than 795,000 Americans suffer from a stroke each year. A stroke is a medical emergency where either:

  • blood flow to the brain is blocked by a clot (Ischemic Stroke) or
  • bleeding in the brain occurs due to a ruptured artery (Hemorrhagic Stroke)

Both events can damage the brain and cause long-term disability or death. The severity of a stroke is often measured using the National Institutes of Health Stroke Scale (NIHSS). The final score (ranging from 0 to 42) is derived using 15 neurological examination questions and the stroke severity can roughly be interpreted using the following bands:

  • Very Severe: >= 25
  • Severe: 15 - 24
  • Moderate: 5 - 14
  • Mild: 1 - 4

Between February of 2021 and April of 2022 various medical details were collected from 400 stroke patients as part of a study sponsored by Penumbra, a company that is developing products to help treat stroke patients. The data is broken into the following three categories:

  • Hospital Data: The study attempted to collect each patient’s medical and social history as well as specific details related to their stroke and any treatment they received. The data is gathered into a table that has 400 rows (1 for each patient) and 248 columns which capture the details gathered for each patient.
  • Proteomics: The study collected blood and clot samples from 80 of the patients and Creative Proteomics ran the protein samples on SDS-PAGE gel followed by in-gel digestion and then identified and quantified 1600+ proteins by applying their nanoLC-MS/MS platform.
  • Genetics: Genotypes for each of the 400 patients were extracted in the form of Single Polynucleotide Polymorphisms (SNPs). Each SNP is a single base location in the DNA where there is known to be substantial population variability. The SNP data collected in the study has over 654K SNPs totaling 9.97GB of data.

Objective

The goal of this analysis was to identify any information contained in the study that could advance our understanding of strokes with the hope of improving patient treatment. Accordingly, this analysis focused on creating explanatory models as opposed to a predictive model.

Analysis

Hospital Data

Exploratory Data Analysis (EDA)

Since the objective of this analysis was to construct an explanatory model, EDA played a key role in assessing data quality, choosing a response variable, selecting candidate predictors, and identifying any data handling techniques needed.

Choosing a Response

The hospital data had several variables that could be used to assess the severity of a patient’s stroke. Such variables include the scores assigned using the Glasgow Coma Scale (GCSSCTOT), National Institutes of Health Stroke Scale (NIHSSTOT), and Modified Rankin Scale (MRSSCORE). Changes in these scores between patient admission and discharge could also be considered to search for treatment effects. Choosing which variable to use as the response for this analysis boiled down to selecting the one with the fewest missing values. For admitted patients, the GCSSCTOT was missing for 335 of the 400 patients, the MRSSCORE was missing for 86 of the patients, and the NIHSSTOT score was only missing for 2 of the patients. Missing values were prevalent within the discharge details as well and as a result, the patient’s total National Institutes of Health Stroke Scale score (NIHSSTOT) at admission was selected as the response variable for this analysis.

Univariate statistics for the NIHSSTOT value can be viewed by expanding the drop down below. The first 2 plots show the distribution of the NIHSSTOT values and the bottom plot shows its cumulative distribution. From these plots we can see that our patient’s stroke severity scores are slightly positively skewed and that they cover the full range of scores with half of the patients having Mild to Moderate stroke severity scores and half having Severe to Very Severe scores.

Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$NIHSSTOT (numeric)
## 
##   length      n    NAs  unique     0s   mean  meanCI'
##      400    396      4      37      2  15.13   14.37
##           99.0%   1.0%           0.5%          15.89
##                                                     
##      .05    .10    .25  median    .75    .90     .95
##     3.00   5.00   9.00   15.00  20.00  25.00   28.00
##                                                     
##    range     sd  vcoef     mad    IQR   skew    kurt
##    40.00   7.68   0.51    8.90  11.00   0.24   -0.40
##                                                     
## lowest : 0.0 (2), 1.0 (11), 2.0 (3), 3.0 (6), 4.0 (10)
## highest: 32.0, 33.0 (3), 34.0, 37.0, 40.0
## 
## ' 95%-CI (classic)

Selecting Candidate Predictors

Since the goal of this analysis was to create an explanatory model, candidate selection for potential predictors was done by manually reviewing the 240+ predictors to determine if there was enough data present to merit an analysis, if there were potential differences in centrality and spread, and if the predictor should reasonably be included in the model based on previous stroke analyses or demographic areas of interest. In the end, the following 23 predictors were selected as candidates for an explanatory model.

  1. GLUC: Glucose Level (mg/dL)
  2. WBC: White Blood Cell Count (k/uL)
  3. RBC: Red Blood Cell Sount (k/uL)
  4. HCT: Hematocrit (Percentage of red blood cells by volume)
  5. HBG: Hemoglobin (g/dL)
  6. CLCLTAR: Clot Area
  7. CLCLTWT: Clot Weight
  8. AGE: Age in years (integer)
  9. SEX: Sex male (1) or female (2)
  10. HEIGHT: Height (CM)
  11. WEIGHT: Weight (KG)
  12. BMI: Body Mass Index (kg/m^2)
  13. MHNONE: Pertinent medical history (Y = Yes)
  14. MHPSIS: Suffered previous ischemic stroke (Y = Yes)
  15. MHPSTIA: Suffered previous transient ischemic attack (Y = Yes)
  16. MHDVT: Suffered from deep vein thrombosis (Y = Yes)
  17. MHDM: Has Diabetes ( Y = Yes)
  18. MHHTN: Has Hypertension (Y = Yes)
  19. MHTHROMB: Has medical history of Thrombocytopenia
  20. MHATEXCR: Has medical history of Extracranial - Carotid Atherosclerosis
  21. MHPSISEL: Previous type of ischemic stroke (when applicable)
  22. SHALCUSE: History of Alcohol Use
  23. SHMRJYN: History of Marijuana use

The summary statistics for each candidate predictor are provided in the tabs below and can be viewed by selecting the predictor tab and clicking the univariate statistics drop down arrow. The relationship between each candidate predictor and the response variable (NIHSSTOT) can be reviewed by expanding the bivariate statistics drop down.

Numeric Candidates

GLUC

The glucose levels of all of the patients in this study were measured in mg/dL. Notably, the data is positively skewed and has a slight positive correlation with stroke severity.

## [1] "Category: Baseline Laboratory Values"
## [1] "Description: Glucose"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$GLUC (numeric)
## 
##   length      n     NAs  unique      0s    mean  meanCI'
##      400    394       6     139       0  140.61  134.70
##           98.5%    1.5%            0.0%          146.53
##                                                        
##      .05    .10     .25  median     .75     .90     .95
##    89.00  96.00  107.00  122.50  149.75  201.70  260.45
##                                                        
##    range     sd   vcoef     mad     IQR    skew    kurt
##   420.00  59.74    0.42   28.91   42.75    2.69    9.02
##                                                        
## lowest : 69.0, 73.0, 74.0, 80.0, 81.0 (2)
## highest: 408.0, 422.0, 438.0, 450.0, 489.0
## 
## ' 95%-CI (classic)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ GLUC (patients)
## 
## Summary: 
## n pairs: 400, valid: 390 (97.5%), missings: 10 (2.5%)
## 
## 
## Pearson corr. : 0.125
## Spearman corr.: 0.133
## Kendall corr. : 0.092
WBC

The white blood cell count was recorded in k/uL or 10^3 cells /mm^3 for nearly every patient. The units are equivalent and the standard range for white blood cell count count is 4 - 11 K/uL. One notable feature of the data is that it is positively skewed and has a slight positive correlation with stroke severity.

## [1] "Category: Baseline Laboratory Values"
## [1] "Description: White Blood Cells"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$WBC (numeric)
## 
##    length       n     NAs  unique       0s     mean   meanCI'
##       400     394       6     207        0   9.4129   8.9587
##             98.5%    1.5%             0.0%            9.8670
##                                                             
##       .05     .10     .25  median      .75      .90      .95
##    4.5755  5.0000  6.4300  8.8000  11.0000  14.3000  16.3050
##                                                             
##     range      sd   vcoef     mad      IQR     skew     kurt
##   59.0600  4.5855  0.4871  3.4100   4.5700   4.2522  40.3834
##                                                             
## lowest : 2.04, 3.2, 3.26, 3.4, 3.8
## highest: 22.4, 23.74, 23.78, 28.1, 61.1
## 
## ' 95%-CI (classic)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ WBC (patients)
## 
## Summary: 
## n pairs: 400, valid: 390 (97.5%), missings: 10 (2.5%)
## 
## 
## Pearson corr. : 0.112
## Spearman corr.: 0.068
## Kendall corr. : 0.045
RBC

The red blood cell count was recorded (equivalently) in k/uL or 10^3 cells /mm^3 for nearly every patient. The data does not have any apparent outliers and is fairly symmetric in distribution. The red blood cell count, on its own, isn’t correlated with the stroke severity but was included since it helps describe the patient’s blood composition.

## [1] "Category: Baseline Laboratory Values"
## [1] "Description: Red Blood Cells"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$RBC (numeric)
## 
##   length       n     NAs  unique      0s    mean  meanCI'
##      400     384      16     201       0  4.3665  4.2914
##            96.0%    4.0%            0.0%          4.4417
##                                                         
##      .05     .10     .25  median     .75     .90     .95
##   3.0500  3.5100  3.8900  4.3750  4.8200  5.2400  5.5585
##                                                         
##    range      sd   vcoef     mad     IQR    skew    kurt
##   5.8000  0.7488  0.1715  0.7042  0.9300  0.2679  1.5501
##                                                         
## lowest : 2.37, 2.41, 2.44, 2.6, 2.67 (2)
## highest: 6.07, 6.17, 6.3, 6.94, 8.17
## 
## ' 95%-CI (classic)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ RBC (patients)
## 
## Summary: 
## n pairs: 400, valid: 380 (95.0%), missings: 20 (5.0%)
## 
## 
## Pearson corr. : 0.039
## Spearman corr.: 0.022
## Kendall corr. : 0.012
HCT

Hematocrit is the percentage of red blood cells by volume. The data is fairly symmetric in distribution and doesn’t appear to have any notable outliers. Like the red blood cell count, Hematocrit on its own isn’t correlated with the stroke severity but was included since it helps describe the patient’s blood composition.

## [1] "Category: Baseline Laboratory Values"
## [1] "Description: Hematocrit"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$HCT (numeric)
## 
##   length      n    NAs  unique     0s   mean  meanCI'
##      400    392      8     187      0  39.22   38.62
##           98.0%   2.0%           0.0%          39.81
##                                                     
##      .05    .10    .25  median    .75    .90     .95
##    28.60  31.00  35.80   39.45  43.40  46.59   48.20
##                                                     
##    range     sd  vcoef     mad    IQR   skew    kurt
##    37.70   5.99   0.15    5.86   7.60  -0.36    0.05
##                                                     
## lowest : 19.0, 21.1, 23.0, 23.9, 24.0
## highest: 51.1, 51.3, 52.4, 52.7, 56.7
## 
## ' 95%-CI (classic)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ HCT (patients)
## 
## Summary: 
## n pairs: 400, valid: 388 (97.0%), missings: 12 (3.0%)
## 
## 
## Pearson corr. : 0.034
## Spearman corr.: 0.026
## Kendall corr. : 0.017
HBG

Each patient’s hemoglobin level was measured in g/dl. The data is fairly symmetric in distribution and doesn’t appear to have any notable outliers. Hemoglobin on its own isn’t correlated with the stroke severity but was included since it helps describe the patient’s blood composition.

## [1] "Category: Baseline Laboratory Values"
## [1] "Description: Hemoglobin"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$HBG (numeric)
## 
##   length      n    NAs  unique     0s   mean  meanCI'
##      400    392      8      98      0  12.88   12.65
##           98.0%   2.0%           0.0%          13.10
##                                                     
##      .05    .10    .25  median    .75    .90     .95
##     8.80   9.80  11.60   13.10  14.40  15.50   16.20
##                                                     
##    range     sd  vcoef     mad    IQR   skew    kurt
##    15.28   2.24   0.17    1.93   2.80  -0.56    0.57
##                                                     
## lowest : 3.72, 5.7, 6.4, 7.0, 7.1
## highest: 17.0, 17.1 (2), 17.2, 17.6 (2), 19.0
## 
## ' 95%-CI (classic)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ HBG (patients)
## 
## Summary: 
## n pairs: 400, valid: 388 (97.0%), missings: 12 (3.0%)
## 
## 
## Pearson corr. : 0.011
## Spearman corr.: 0.007
## Kendall corr. : 0.003
CLCLTAR

The CLCLTAR is the clot area. The units were not provided but were presumably entered in square millimeters. The data is positively skewed and does appear to have some outliers. Additionally, there were 42 patients that did not have a value for the clot area. The clot area was included since it has a slight positive correlation with stroke severity.

## [1] "Category: Histopathology Results of Thrombus Retrieval"
## [1] "Description: Clot area"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$CLCLTAR (numeric)
## 
##     length       n    NAs  unique      0s    mean  meanCI'
##        400     358     42     121       0  186.17  144.05
##              89.5%  10.5%            0.0%          228.28
##                                                          
##        .05     .10    .25  median     .75     .90     .95
##       8.55   20.00  40.00   84.00  150.00  350.30  640.00
##                                                          
##      range      sd  vcoef     mad     IQR    skew    kurt
##   4'499.00  405.22   2.18   76.35  110.00    6.32   51.07
##                                                          
## lowest : 1.0, 2.0 (4), 2.5, 3.0 (2), 4.0 (4)
## highest: 1'800.0 (2), 1'848.0, 2'250.0, 3'500.0, 4'500.0
## 
## ' 95%-CI (classic)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ CLCLTAR (patients)
## 
## Summary: 
## n pairs: 400, valid: 354 (88.5%), missings: 46 (11.5%)
## 
## 
## Pearson corr. : 0.121
## Spearman corr.: 0.199
## Kendall corr. : 0.133
CLCLTWT

The CLCLTWT is the clot weight. The units were not provided and the data is positively skewed and does appear to have some outiers. Similar to CLCLTAR, the clot weight was not available for 42 of the patients. The clot weight was included since it has a slight positive correlation with stroke severity.

## [1] "Category: Histopathology Results of Thrombus Retrieval"
## [1] "Description: Clot weight"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$CLCLTWT (numeric)
## 
##     length       n    NAs  unique     0s    mean  meanCI'
##        400     358     42     158      0  104.99   82.72
##              89.5%  10.5%           0.0%          127.27
##                                                         
##        .05     .10    .25  median    .75     .90     .95
##       5.00    9.00  20.25   46.00  88.75  201.00  366.50
##                                                         
##      range      sd  vcoef     mad    IQR    skew    kurt
##   1'562.50  214.32   2.04   43.00  68.50    4.62   23.56
##                                                         
## lowest : 0.5, 1.0 (3), 2.0 (3), 3.0 (2), 4.0 (8)
## highest: 1'350.0, 1'355.0, 1'430.0, 1'485.0, 1'563.0
## 
## ' 95%-CI (classic)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ CLCLTWT (patients)
## 
## Summary: 
## n pairs: 400, valid: 354 (88.5%), missings: 46 (11.5%)
## 
## 
## Pearson corr. : 0.099
## Spearman corr.: 0.181
## Kendall corr. : 0.122
AGE

The Age variable, measured in years, is symmetrically distributed without any outliers. Age, on its own, doesn’t appear to have a notable correlation with stroke severity but was included since it helps describe the patient’s demographics.

## [1] "Category: Demographics"
## [1] "Description: Age (years)"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$AGE (numeric)
## 
##   length       n    NAs  unique     0s   mean  meanCI'
##      400     400      0      67      0  68.92   67.46
##           100.0%   0.0%           0.0%          70.37
##                                                      
##      .05     .10    .25  median    .75    .90     .95
##    43.00   48.90  58.00   70.00  80.00  87.00   90.00
##                                                      
##    range      sd  vcoef     mad    IQR   skew    kurt
##    71.00   14.80   0.21   16.31  22.00  -0.40   -0.40
##                                                      
## lowest : 27.0, 28.0 (2), 29.0, 30.0, 32.0
## highest: 94.0 (2), 95.0, 96.0, 97.0 (2), 98.0
## 
## ' 95%-CI (classic)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ AGE (patients)
## 
## Summary: 
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%)
## 
## 
## Pearson corr. : 0.096
## Spearman corr.: 0.099
## Kendall corr. : 0.069
HEIGHT

The patient’s height was measured in centimeters and the values are fairly symmetric in distribution. Two of the patients had heights less than 95 centimeters which seemed unlikely and causes issues with BMI. As a result, the data associated with these patients was dropped from the analysis. The patient’s height does not have a notable correlation with the stroke severity but was included since it helps describe the patient’s physical attributes.

## [1] "Category: Demographics"
## [1] "Description: Height"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$HEIGHT (numeric)
## 
##   length       n     NAs  unique      0s    mean  meanCI'
##      400     382      18      69       0  169.72  168.51
##            95.5%    4.5%            0.0%          170.93
##                                                         
##      .05     .10     .25  median     .75     .90     .95
##   154.90  157.50  162.60  169.00  177.95  182.90  187.00
##                                                         
##    range      sd   vcoef     mad     IQR    skew    kurt
##   130.00   12.00    0.07   11.86   15.35   -1.94   14.95
##                                                         
## lowest : 70.0, 93.98, 142.2, 147.3, 149.9 (2)
## highest: 193.0 (2), 195.0, 195.6, 198.1, 200.0
## 
## heap(?): remarkable frequency (7.1%) for the mode(s) (= 160)
## 
## ' 95%-CI (classic)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ HEIGHT (patients)
## 
## Summary: 
## n pairs: 400, valid: 378 (94.5%), missings: 22 (5.5%)
## 
## 
## Pearson corr. : -0.063
## Spearman corr.: -0.037
## Kendall corr. : -0.026
WEIGHT

Each patient’s weight was measured in kilograms and the values are slightly positively skewed. The patient’s weight does not have a notable correlation with the stroke severity but was included since it helps describe the patient’s physical attributes.

## [1] "Category: Demographics"
## [1] "Description: Weight"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$WEIGHT (numeric)
## 
##   length      n    NAs  unique     0s    mean  meanCI'
##      400    399      1     274      0   85.08   82.85
##           99.8%   0.2%           0.0%           87.31
##                                                      
##      .05    .10    .25  median    .75     .90     .95
##    55.18  59.00  69.00   81.90  97.75  116.12  130.00
##                                                      
##    range     sd  vcoef     mad    IQR    skew    kurt
##   128.80  22.67   0.27   20.61  28.75    0.90    0.90
##                                                      
## lowest : 41.2, 42.2, 46.7, 47.7, 49.0
## highest: 150.0, 152.9, 154.0, 162.0, 170.0 (2)
## 
## ' 95%-CI (classic)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ WEIGHT (patients)
## 
## Summary: 
## n pairs: 400, valid: 395 (98.8%), missings: 5 (1.2%)
## 
## 
## Pearson corr. : -0.038
## Spearman corr.: -0.039
## Kendall corr. : -0.027
BMI

BMI is the patient’s Body Mass Index in kilograms / meter^2. The values are positively skewed and contain 2 outliers as a result of the patients with heights less than 95cm. BMI does not have a notable correlation with the stroke severity but was included since it helps describe the patient’s physical attributes.

## [1] "Category: Demographics"
## [1] "Description: Body Mass Index"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$BMI (numeric)
## 
##     length        n      NAs   unique       0s     mean    meanCI'
##        400      382       18      340        0  29.8525   28.7551
##               95.5%     4.5%              0.0%            30.9499
##                                                                  
##        .05      .10      .25   median      .75      .90       .95
##    19.5380  21.5130  24.3000  28.0450  33.1175  39.2330   44.0970
##                                                                  
##      range       sd    vcoef      mad      IQR     skew      kurt
##   167.5800  10.9084   0.3654   6.2417   8.8175   7.7457  101.9419
##                                                                  
## lowest : 16.07, 16.61, 16.83, 16.9, 17.5
## highest: 55.09, 60.95, 61.26, 73.07, 183.65
## 
## ' 95%-CI (classic)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ BMI (patients)
## 
## Summary: 
## n pairs: 400, valid: 378 (94.5%), missings: 22 (5.5%)
## 
## 
## Pearson corr. : 0.041
## Spearman corr.: -0.035
## Kendall corr. : -0.026

Categorical Candidates

SEX

The Sex variable indicates if a patient is a male (1) or female (2). While there isn’t a noticeable difference in the mean stroke severity, the box plots have a difference in variance and sex was selected as a candidate variable as a result.

## [1] "Category: Demographics"
## [1] "Description: Sex"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$SEX (character - dichotomous)
## 
##   length      n    NAs unique
##      400    400      0      2
##          100.0%   0.0%       
## 
##    freq   perc  lci.95  uci.95'
## 2   206  51.5%   46.6%   56.4%
## 1   194  48.5%   43.6%   53.4%
## 
## ' 95%-CI (Wilson)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ SEX (patients)
## 
## Summary: 
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
## 
##                         
##               1        2
## mean     14.848   15.395
## median   15.000   15.000
## sd        7.227    8.092
## IQR       9.000   12.000
## n           191      205
## np      48.232%  51.768%
## NAs           3        1
## 0s            1        1
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 0.25255, df = 1, p-value = 0.6153
MHNONE

The MHNONE variable is ‘Y’ if there is no pertinent medical history for the Patient. Patients who indicated that they had no pertinent medical history appear (from the boxplots) to have less severe strokes and the factor was selected as a candidate as a result.

## [1] "Category: Medical History"
## [1] "Description: No Pertinent Medical History"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$MHNONE (character - dichotomous)
## 
##   length      n    NAs unique
##      400    400      0      2
##          100.0%   0.0%       
## 
##      freq   perc  lci.95  uci.95'
## UNK   371  92.8%   89.8%   94.9%
## Y      29   7.2%    5.1%   10.2%
## 
## ' 95%-CI (Wilson)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ MHNONE (patients)
## 
## Summary: 
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
## 
##                         
##             UNK        Y
## mean     15.305   12.931
## median   15.000   13.000
## sd        7.701    7.201
## IQR      11.500   10.000
## n           367       29
## np      92.677%   7.323%
## NAs           4        0
## 0s            2        0
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 2.6381, df = 1, p-value = 0.1043
MHPSIS

The MHPSIS variable indicates if a patient suffered from a previous ischemic stroke. Patients who suffered from prior Ischemic strokes appear to experience more severe strokes making this a good candidate predictor.

## [1] "Category: Medical History"
## [1] "Description: Previous Ischemic Stroke"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$MHPSIS (character - dichotomous)
## 
##   length      n    NAs unique
##      400    400      0      2
##          100.0%   0.0%       
## 
##      freq   perc  lci.95  uci.95'
## UNK   344  86.0%   82.3%   89.1%
## Y      56  14.0%   10.9%   17.7%
## 
## ' 95%-CI (Wilson)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ MHPSIS (patients)
## 
## Summary: 
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
## 
##                         
##             UNK        Y
## mean     14.713   17.778
## median   14.500   19.000
## sd        7.680    7.218
## IQR      11.000   11.500
## n           342       54
## np      86.364%  13.636%
## NAs           2        2
## 0s            2        0
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 8.5409, df = 1, p-value = 0.003473
MHPSTIA

The MHPSTIA variable indicates if a patient had a previous transient ischemic attack. Only 10 patients indicated that they had a previous transient ischemic attack and their median stroke severity appears (based on the boxplots) to be slightly less than those who didn’t experience a previous transient ischemic attack. This was selected as a candidate predictor due to its relation to ischemic strokes.

## [1] "Category: Medical History"
## [1] "Description: Previous Transient Ischemic Attack"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$MHPSTIA (character - dichotomous)
## 
##   length      n    NAs unique
##      400    400      0      2
##          100.0%   0.0%       
## 
##      freq   perc  lci.95  uci.95'
## UNK   390  97.5%   95.5%   98.6%
## Y      10   2.5%    1.4%    4.5%
## 
## ' 95%-CI (Wilson)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ MHPSTIA (patients)
## 
## Summary: 
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
## 
##                         
##             UNK        Y
## mean     15.153   14.300
## median   15.000   11.500
## sd        7.698    7.379
## IQR      11.000   12.250
## n           386       10
## np      97.475%   2.525%
## NAs           4        0
## 0s            2        0
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 0.12551, df = 1, p-value = 0.7231
MHDVT

MHDVT indicates if a patient has suffered from deep vein thrombosis. An interesting feature of the boxplots is that the data for patients who suffered from deep vein thrombosis appears to be positively skewed. This may suggest that these patients are less likely to experience mild to moderate strokes.

## [1] "Category: Medical History"
## [1] "Description: Deep Vein Thrombosis"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$MHDVT (character - dichotomous)
## 
##   length      n    NAs unique
##      400    400      0      2
##          100.0%   0.0%       
## 
##      freq   perc  lci.95  uci.95'
## UNK   382  95.5%   93.0%   97.1%
## Y      18   4.5%    2.9%    7.0%
## 
## ' 95%-CI (Wilson)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ MHDVT (patients)
## 
## Summary: 
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
## 
##                         
##             UNK        Y
## mean     15.103   15.722
## median   15.000   15.000
## sd        7.695    7.607
## IQR      11.000    6.750
## n           378       18
## np      95.455%   4.545%
## NAs           4        0
## 0s            1        1
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 0.16318, df = 1, p-value = 0.6862
MHDM

The MHDM variable indicates if a patient has diabetes. Patients with diabetes appear to suffer from more severe strokes on average. This aligns with GLUC numeric predictor.

## [1] "Category: Medical History"
## [1] "Description: Diabetes Mellitus"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$MHDM (character - dichotomous)
## 
##   length      n    NAs unique
##      400    400      0      2
##          100.0%   0.0%       
## 
##      freq   perc  lci.95  uci.95'
## UNK   296  74.0%   69.5%   78.1%
## Y     104  26.0%   21.9%   30.5%
## 
## ' 95%-CI (Wilson)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ MHDM (patients)
## 
## Summary: 
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
## 
##                         
##             UNK        Y
## mean     14.646   16.529
## median   15.000   16.500
## sd        7.894    6.883
## IQR      11.000   10.750
## n           294      102
## np      74.242%  25.758%
## NAs           2        2
## 0s            1        1
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 5.5448, df = 1, p-value = 0.01854
MHHTN

The MHHTN variable indicates if a patient has Hypertension (High blood-pressure). Patients with Hypertension appear to have more severe strokes on average.

## [1] "Category: Medical History"
## [1] "Description: Hypertension"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$MHHTN (character - dichotomous)
## 
##   length      n    NAs unique
##      400    400      0      2
##          100.0%   0.0%       
## 
##      freq   perc  lci.95  uci.95'
## Y     290  72.5%   67.9%   76.6%
## UNK   110  27.5%   23.4%   32.1%
## 
## ' 95%-CI (Wilson)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ MHHTN (patients)
## 
## Summary: 
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
## 
##                         
##             UNK        Y
## mean     13.636   15.706
## median   14.000   15.000
## sd        7.566    7.662
## IQR      10.000   11.000
## n           110      286
## np      27.778%  72.222%
## NAs           0        4
## 0s            0        2
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 5.4604, df = 1, p-value = 0.01945
MHTHROMB

The MHTHROMB variable captures if a patient has a medical history of Thrombocytopenia which occurs when platelet counts are low. There seems to be a difference in severity between patients who have a history of Thrombocytopenia and those who don’t but there are a lot of missing values.

## [1] "Category: Medical History"
## [1] "Description: Thrombocytopenia"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$MHTHROMB (character)
## 
##   length      n    NAs unique levels  dupes
##      400    400      0      3      3      y
##          100.0%   0.0%                     
## 
##    level  freq   perc  cumfreq  cumperc
## 1    UNK   372  93.0%      372    93.0%
## 2      N    19   4.8%      391    97.8%
## 3      Y     9   2.2%      400   100.0%
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ MHTHROMB (patients)
## 
## Summary: 
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 3
## 
##                                  
##               N      UNK        Y
## mean     16.316   15.076   14.889
## median   15.000   15.000   12.000
## sd        6.750    7.724    8.403
## IQR       8.000   11.000    6.000
## n            19      368        9
## np       4.798%  92.929%   2.273%
## NAs           0        4        0
## 0s            1        1        0
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 0.92621, df = 2, p-value = 0.6293
MHATEXCR

The MHATEXCR variable captures if a patient has a medical history of Extracranial - Carotid Atherosclerosis which is a hardening and narrowing of vessels due to fat deposits. There seems to be a difference in severity between patients who have a history of Extracranial - Carotid Atherosclerosis and those who don’t but there are a lot of missing values.

## [1] "Category: Medical History"
## [1] "Description: Extracranial - Carotid Atherosclerosis"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$MHATEXCR (character)
## 
##   length      n    NAs unique levels  dupes
##      400    400      0      3      3      y
##          100.0%   0.0%                     
## 
##    level  freq   perc  cumfreq  cumperc
## 1    UNK   367  91.8%      367    91.8%
## 2      N    20   5.0%      387    96.8%
## 3      Y    13   3.2%      400   100.0%
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ MHATEXCR (patients)
## 
## Summary: 
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 3
## 
##                                  
##               N      UNK        Y
## mean     16.579   15.027   15.923
## median   16.000   15.000   18.000
## sd        7.827    7.742    5.766
## IQR      11.000   11.000    6.000
## n            19      364       13
## np       4.798%  91.919%   3.283%
## NAs           1        3        0
## 0s            0        2        0
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 0.8365, df = 2, p-value = 0.6582
MHPSISEL

The MHPSISEL variable captures the previous type of ischemic stroke (when applicable). While there are quite a few missing values, it does appear that patients who had a cardio or cryptogenic Ischemic Stroke previously experience more severe strokes when compared against patients that didn’t have a previous ischemic stroke or had a LAA or SAO stroke.

## [1] "Category: Medical History"
## [1] "Description: Type of Previous Ischemic Stroke"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$MHPSISEL (character)
## 
##   length      n    NAs unique levels  dupes
##      400    400      0      5      5      y
##          100.0%   0.0%                     
## 
##        level  freq   perc  cumfreq  cumperc
## 1        UNK   345  86.2%      345    86.2%
## 2  CRYPTOGEN    29   7.2%      374    93.5%
## 3     CARDIO    15   3.8%      389    97.2%
## 4        SAO     8   2.0%      397    99.2%
## 5        LAA     3   0.8%      400   100.0%
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ MHPSISEL (patients)
## 
## Summary: 
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 5
## 
##                                                              
##            CARDIO  CRYPTOGEN        LAA        SAO        UNK
## mean       18.000     18.179     13.667     16.000     14.752
## median     20.000     18.500     13.000     15.000     15.000
## sd          6.370      7.414      6.028      8.699      7.702
## IQR        10.500      9.750      6.000     11.500     11.000
## n              15         28          3          7        343
## np         3.788%     7.071%     0.758%     1.768%    86.616%
## NAs             0          1          0          1          2
## 0s              0          0          0          0          2
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 9.0507, df = 4, p-value = 0.05984
SHMRJYN

The SHMRJYN variable indicates if a patient uses marijuana. The data for users appears to be positively skewed which may be an indication that users are less likely to experience mild to moderate strokes.

## [1] "Category: Medical History"
## [1] "Description: Marijuana Use"
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$SHMRJYN (character - dichotomous)
## 
##   length      n    NAs unique
##      400    400      0      2
##          100.0%   0.0%       
## 
##      freq   perc  lci.95  uci.95'
## UNK   378  94.5%   91.8%   96.3%
## Y      22   5.5%    3.7%    8.2%
## 
## ' 95%-CI (Wilson)
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ SHMRJYN (patients)
## 
## Summary: 
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 2
## 
##                         
##             UNK        Y
## mean     15.029   16.864
## median   15.000   15.500
## sd        7.740    6.527
## IQR      11.000    6.500
## n           374       22
## np      94.444%   5.556%
## NAs           4        0
## 0s            2        0
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 1.4397, df = 1, p-value = 0.2302
SHALCUSE

The SHALCUSE variable captures how many drinks a person estimates that they have per week. Although there are relatively low counts, it appears that the stroke severity of people who report drinking weekly is lower than it is for people who don’t.

## [1] "Category: Medical History"
## [1] "Description: Frequency of Alcohol Use"
## 
##    1DRINK    2DRINK 3TO5DRINK GTE6DRINK       UNK 
##        24         6        20        27       323
Univariate Statistics
## ------------------------------------------------------------------------------ 
## patients$SHALCUSE (character)
## 
##   length      n    NAs unique levels  dupes
##      400    400      0      5      5      y
##          100.0%   0.0%                     
## 
##        level  freq   perc  cumfreq  cumperc
## 1        UNK   323  80.8%      323    80.8%
## 2  GTE6DRINK    27   6.8%      350    87.5%
## 3     1DRINK    24   6.0%      374    93.5%
## 4  3TO5DRINK    20   5.0%      394    98.5%
## 5     2DRINK     6   1.5%      400   100.0%
Bivariate Statistics
## ------------------------------------------------------------------------------ 
## NIHSSTOT ~ SHALCUSE (patients)
## 
## Summary: 
## n pairs: 400, valid: 396 (99.0%), missings: 4 (1.0%), groups: 5
## 
##                                                              
##            1DRINK     2DRINK  3TO5DRINK  GTE6DRINK        UNK
## mean       15.250     10.500     11.550     14.630     15.476
## median     15.500      8.500     11.000     14.000     15.000
## sd          5.944      8.620      7.409      7.692      7.753
## IQR         7.500      8.750      8.500      7.500     12.000
## n              24          6         20         27        319
## np         6.061%     1.515%     5.051%     6.818%    80.556%
## NAs             0          0          0          0          4
## 0s              0          0          0          0          2
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 7.9449, df = 4, p-value = 0.09362

Linear Regression

Linear regression can be used to create an explanatory model that helps us understand which of the candidate factors selected in the EDA process contribute to higher stroke severity scores in patients. As noted in the EDA section, some of the candidate predictors are highly skewed, have outliers, and contain missing values all of which can pose challenges for linear regression. These issues were addressed via data selection, imputation, and data transformations.

Data selection

The hospital data contains data for patients suffering from two distinct stroke types, ischemic and hemorrhagic. There were only 10 patients who both suffered from hemorrhagic stroke and had a NIHSSTOT score. To focus the study, we chose to exclusively examine ischemic stroke patients. Additionally, 2 of the patients had heights below 95 cm and were dropped from the study. This selection reduced the initial data set from 400 to 386 patients.

Filter Data
#Build LR data set
LR.data = patients %>%
  filter(!is.na(NIHSSTOT) & IEESTRTY == "ISC" & !(SubjectID %in% c("00272-014","00122-001")) )%>%# Focus on ISC candidates and filter out patients with na HIHSSTOT 
  dplyr::select(SubjectID, NIHSSTOT # ID and response variables
                , GLUC, WBC, RBC, HCT, HBG  # Lab metrics from blood
                , CLCLTAR, CLCLTWT# Clot metrics
                , AGE, HEIGHT, WEIGHT, BMI # Numeric Demographics
                , MHNONE, MHPSIS, MHPSTIA, MHDVT, MHDM, MHHTN, SEX, SHMRJYN, SHALCYN # Binary
                , MHTHROMB, MHATEXCR # Quartary
                , SHALCUSE, MHPSISEL # Quintary
                ) 

Imputation

Once the data of interest was selected, there were still a fair amount of missing values in the data. For Numeric data, missing values were replaced with the median value. Missing values for categorical variables were encoded as ‘UNK’ for unknown.

Impute Median
#LR.data$NIHISTOT = as.numeric(LR.data$NIHISTOT)
#LR.data$NPASS = as.numeric(LR.data$NPASS )
# For Numeric Data, impute the median into missing values
for(i in 1:11){
  LR.data[is.na(LR.data[ , (i+2)]) , (i+2)] = median(LR.data[ , (i+2)][[1]], na.rm = T)
  }

Power Transformations

Power transformations were reviewed for each numeric variable that showed signs of skewness or outliers. The following transformations were made to the data to address skewness and reign in outliers.

GLUC

The log-likelihood curve from box cox analysis (shown below) was used to select an inverse transformation for the glucose variable.

  • \(GLUC_{new} = \frac{GLUC_{old}^{-1}-1}{-1}\)

The box plots below show the improvements realized by the transformation.

WBC

The log-likelihood curve from box cox analysis (shown below) was used to select a log transformation for the white blood cell count variable.

  • \(WBC_{new} = log(WBC_{old})\)

The box plots below show the improvements realized by the transformation.

CLCLTAR

The log-likelihood curve from box cox analysis (shown below) was used to select a log transformation for the clot area variable.

  • \(CLCLTAR_{new} = log(CLCLTAR_{old})\)

The box plots below show the improvements realized by the transformation.

CLCLTWT

The log-likelihood curve from box cox analysis (shown below) was used to select a log transformation for the clot weight variable.

  • \(CLCLTWT_{new} = log(CLCLTWT_{old})\)

The box plots below show the improvements realized by the transformation.

BMI

The log-likelihood curve from box cox analysis (shown below) was used to select a negative square root transformation for the BMI variable.

  • \(BMI_{new} = \frac{BMI_{old}^{-1/2}-1}{-1/2}\)

The box plots below show the improvements realized by the transformation.

Feature Engineering

The SHALCUSE variable indicates how many drinks an individual has per week. This variable was converted to a numeric variable to reflect that as the value increases, so does the number of drinks consumed by the patient. This was the only candidate variable that needed to be refined using feature engineering.

LR.data = LR.data %>%
  mutate(
         SHALC = case_when(
            SHALCUSE == "1DRINK" ~ 1
           ,SHALCUSE == "2DRINK" ~ 2
           ,SHALCUSE == "3TO5DRINK" ~ 4
           ,SHALCUSE == "GTE6DRINK" ~ 7
           ,TRUE ~ 0 ))

Variable Selection

Variable selection was performed by considering all pairwise interactions between the numeric and categorical candidate features and selecting a subset that yielded strong model performance. This process compared the resulting R squared value, BIC, and RSS of models created using forward selection, backward selection, and sequential replacement. The resulting values are shown in the charts below which were used to determine that a linear model with 12 variables would likely explain 15 - 20 percent of the variability within stroke severity without drastically increasing the model BIC. A model with 12 parameters created using a sequential replacement process was selected as the final linear regression model and further refined to increase interpretability.

Model Refinement

First Iteration

The parameters and diagnostic plots for the first iteration of the final model are shown below. The diagnostic plots indicate that the linear model fits the data reasonably well and that we can proceed with refining the model. We observe that the CLCLTAR.tran:MHHTN terms are not statistically significant. The first model refinement is to drop these from the model.

## 
## Call:
## lm(formula = NIHSSTOT ~ AGE + SEX + GLUC.tran:SEX + WBC.tran:MHPSIS + 
##     WBC.tran:SHMRJYN + HCT:SEX + HBG:SEX + CLCLTAR.tran:MHHTN + 
##     CLCLTAR.tran:SEX + SHALC:MHPSIS + SHALC:MHDM + SHALC:MHHTN, 
##     data = LR.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.5453  -4.9156  -0.2949   4.4144  24.6650 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)   
## (Intercept)            179.72033  250.54557   0.717  0.47364   
## AGE                      0.07285    0.02674   2.725  0.00674 **
## SEX2                  -925.04733  340.51178  -2.717  0.00691 **
## SEX1:GLUC.tran        -180.35530  252.32706  -0.715  0.47521   
## SEX2:GLUC.tran         746.54666  245.05111   3.046  0.00248 **
## WBC.tran:MHPSISUNK       0.84498    0.92400   0.914  0.36107   
## WBC.tran:MHPSISY         2.48533    1.02215   2.431  0.01552 * 
## WBC.tran:SHMRJYNY        1.92957    0.75260   2.564  0.01075 * 
## SEX1:HCT                 0.12184    0.22112   0.551  0.58195   
## SEX2:HCT                 0.88788    0.33881   2.621  0.00915 **
## SEX1:HBG                 0.07540    0.56638   0.133  0.89417   
## SEX2:HBG                -2.48925    0.94942  -2.622  0.00911 **
## CLCLTAR.tran:MHHTNUNK   -0.10655    0.49936  -0.213  0.83116   
## CLCLTAR.tran:MHHTNY      0.34837    0.47584   0.732  0.46457   
## SEX2:CLCLTAR.tran        2.11258    0.64821   3.259  0.00122 **
## MHPSISUNK:SHALC          0.17884    0.33872   0.528  0.59783   
## MHPSISY:SHALC           -2.25467    1.01463  -2.222  0.02689 * 
## SHALC:MHDMY              1.47885    0.52240   2.831  0.00490 **
## MHHTNY:SHALC            -1.11189    0.43269  -2.570  0.01057 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.975 on 365 degrees of freedom
## Multiple R-squared:  0.2083, Adjusted R-squared:  0.1692 
## F-statistic: 5.334 on 18 and 365 DF,  p-value: 4.856e-11

Second Iteration

Dropping the CLCLTAR.tran:MHHTN terms from the model does not drastically impact the R squared value. Reviewing the model we find that the statistically significant terms involving sex only indicate differences for females. The model can be simplified by Coding SEX as an indicator variable for female. The same can be done for the MHPSIS variable to indicate if the patient had a previous ischemic stroke. This was done to create the final model on the next tab.

## 
## Call:
## lm(formula = NIHSSTOT ~ AGE + SEX + GLUC.tran:SEX + WBC.tran:MHPSIS + 
##     WBC.tran:SHMRJYN + HCT:SEX + HBG:SEX + CLCLTAR.tran:SEX + 
##     SHALC:MHPSIS + SHALC:MHDM + SHALC:MHHTN, data = LR.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.0587  -4.7880  -0.2755   4.6702  25.2325 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          91.26026  248.61862   0.367  0.71378    
## AGE                   0.08286    0.02649   3.128  0.00190 ** 
## SEX2               -899.37944  342.10571  -2.629  0.00893 ** 
## SEX1:GLUC.tran      -91.47090  250.40151  -0.365  0.71510    
## SEX2:GLUC.tran      809.61116  244.65999   3.309  0.00103 ** 
## WBC.tran:MHPSISUNK    0.74869    0.92784   0.807  0.42024    
## WBC.tran:MHPSISY      2.49040    1.02753   2.424  0.01585 *  
## WBC.tran:SHMRJYNY     1.87816    0.75620   2.484  0.01345 *  
## SEX1:HCT              0.15149    0.22188   0.683  0.49518    
## SEX2:HCT              0.82477    0.33938   2.430  0.01557 *  
## SEX1:HBG             -0.03871    0.56698  -0.068  0.94561    
## SEX2:HBG             -2.33112    0.95170  -2.449  0.01478 *  
## SEX1:CLCLTAR.tran     0.24355    0.47596   0.512  0.60917    
## SEX2:CLCLTAR.tran     2.37560    0.44121   5.384  1.3e-07 ***
## MHPSISUNK:SHALC      -0.05462    0.32346  -0.169  0.86600    
## MHPSISY:SHALC        -2.53661    1.01185  -2.507  0.01261 *  
## SHALC:MHDMY           1.45564    0.52504   2.772  0.00585 ** 
## SHALC:MHHTNY         -0.76512    0.40525  -1.888  0.05981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.012 on 366 degrees of freedom
## Multiple R-squared:  0.1977, Adjusted R-squared:  0.1604 
## F-statistic: 5.305 on 17 and 366 DF,  p-value: 1.535e-10

Final Model

The final model explains 16.5 percent of the variability in the stroke severity score among patients. The amount of variance explained compared to the total variance is visually represented using histograms on the following tab. The model has revealed some interesting details regarding the explanatory variables. The final model parameters estimates along with their p-values and confidence intervals are shown below. The diagnostic plots are also provided and indicate that the final model fits the data reasonably well. Model interpretations are provided in the following section.

## 
## Call:
## lm(formula = NIHSSTOT ~ AGE + FEMALE + GLUC.tran:FEMALE + HCT:FEMALE + 
##     HBG:FEMALE + CLCLTAR.tran:FEMALE + WBC.tran:PREV_ISC + WBC.tran:SHMRJYN + 
##     SHALC:PREV_ISC + SHALC:MHDM + SHALC:MHHTN, data = LR.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.0359  -5.0470  -0.1371   4.5406  25.4849 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            9.32666    1.84037   5.068 6.35e-07 ***
## AGE                    0.07457    0.02574   2.897 0.003987 ** 
## FEMALE              -865.40136  234.74403  -3.687 0.000261 ***
## FEMALE:GLUC.tran     860.03318  237.18279   3.626 0.000328 ***
## FEMALE:HCT             0.82557    0.33831   2.440 0.015141 *  
## FEMALE:HBG            -2.31997    0.94836  -2.446 0.014896 *  
## FEMALE:CLCLTAR.tran    2.36233    0.43941   5.376 1.35e-07 ***
## WBC.tran:PREV_ISC      1.79133    0.49050   3.652 0.000297 ***
## WBC.tran:SHMRJYNY      1.87182    0.74565   2.510 0.012486 *  
## PREV_ISC:SHALC        -2.45977    0.96284  -2.555 0.011025 *  
## SHALC:MHDMY            1.41145    0.50649   2.787 0.005597 ** 
## SHALC:MHHTNY          -0.76885    0.26910  -2.857 0.004516 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.993 on 372 degrees of freedom
## Multiple R-squared:  0.189,  Adjusted R-squared:  0.165 
## F-statistic: 7.881 on 11 and 372 DF,  p-value: 2.457e-12

Final Model Histograms

The histogram below shows the distribution of the stroke severity score among patients. The wide spread illustrates the variance in the score.

The histogram below shows the distribution of the stroke severity scores predicted by the final model. The narrow spread illustrates the variance in predicted values. When we compare it against the first histogram, we observe the the model predictions are more normal in distribution and have a lower amount of variation. This depicts the amount of variation in stroke severity that the final model accounts for.

Model Interpretation

Intercept

The intercept in this model can be interpreted as the expected stroke severity score for male patients that:

  • are not diabetic
  • don’t have a prior ischemic stroke
  • don’t have hypertension
  • have not used marijuana

Age

Age has a positive correlation with stroke severity score but it is more statistically significant than practically significant. The coefficient indicates that for each 13 year increase in age, the stroke severity is expected to increase by 1 point.

Sex Interactions

If the patient is a female, her expected stroke severity score depends on her Glucose levels, the size of the clot, her Hematocrit (percent of red blood cells), and her Hemoglobin levels. The charts below show how a female patient’s stroke severity score is expected to be impacted by these levels. The gray and black lines indicate the 1st quartile, median, and 3rd quartile for the glucose, clot area, hematocrit, and hemoglobin among females and the range on the X axis covers the max and min values measured in the study.

White Blood Cell Counts

White blood cell count is positively correlated with stroke severity for patients that have had a prior ischemic stroke and or have history of using marijuana. The chart below shows how the stroke severity is impacted by white blood cell count. The green line indicates the trend for patients who have had a previous ischemic stroke but have not used marijuana previously. The blue line shows the trend for patients who have used marijuana previously and have had a prior ischemic stroke. The red line shows the trend for patients who have both had a previous ischemic stroke and have used marijuana in the past. The gray and black lines show the 1st quartile, the median, and the 3rd quartile.

Alcohol Consumption

Alcohol consumption was another factor that had an influence on a patients stroke severity but its influence depended on if the patient had a previous ischemic stroke, if they were diabetic, and if they had hypertension. The trends for each of these and the possible combinations are shown in the graph below. Alcohol consumption generally is associated with a decrease in stroke severity for patients without diabetes. This should be taken lightly since the majority of patients didn’t report how much alcohol they consumed. For those patients, their alcohol consumption was coded as a 0 and, as the chart indicates, no adjustment was made to their score.

Proteomics Data

The proteomics data set was evaluated using a differential expression analysis to search for biomarkers. This was done by regressing the NIHSSTOT response variable onto each protein using a linear model and reviewing the resulting p-values to determine if the protein was statistically significant. Two inherent challenges in this process were handling skewness in the protein data and accounting for false discovery among the large number of models created. These challenges were addressed using the following approaches:

  1. Skewness: The protein data was highly positively skewed with large concentrations of 0s. This was addressed by treating 0s as missing values for the proteins and then log transforming the non-zero values prior to creating a linear model.
  2. False Discovery: After creating a model for each protein, their corresponding p-values were replaced with q-values using the approach outlined in the Storey & Tibshirani (2003) paper entitled “Statistical significance for genomewide studies.”

As is sometimes the case, none of the q-values associated with the proteins were statistically significant. The search was expanded by adding the candidate predictors from the hospital data as covariates in the linear protein models. Additionally, new variable called STROKE_BELT was introduced to see if any biomarkers existed when the model accounted for regions in the US that have higher concentrations of strokes. This wider sweep of the data identified 1 biomarker which was present in the model that incorporated the individual’s sex.

The p-value plots below show the histograms of the p-values for the base models, the models that incorporate the patient’s sex, and the models that incorporate the stroke belt indicator. If there are no significant proteins, then we expect the p-values to be uniformly distributed. The chart below each histogram shows the proportion of truly null values as a function of the tuning parameter \(\lambda\). The cubic spline fit to the \(\pi(\lambda)\) vs \(\lambda\) is used to estimate the proportion of null values at \(\lambda =1\) for determining the q-values.

P Value Plots

Base Model

Base Model + FEMALE

##   Protein     beta       se      p_value    Q_VALUE          Protein_IDs
## 1   P_351 24.49994 2.293611 3.971638e-05 0.03447381 P06493;Q07785;P61075
##   Majority_protein_IDs
## 1               P06493

Base Model + STROKE_BELT

Significant Proteins

The only proteins with statistically significant q-values where those associated with protein batch 351 when the explanatory model includes sex as a covariate:

  • \(NIHSSTOT = P_{351}+FEMALE\)

The plot below shows what the resulting model looks like. When stroke severity is regressed onto protein \(P_{351}\), there is a statistically significant relationship if sex is included as a covariate. Stroke severity increases at the same rate for females and males but there is an almost 10 point vertical shift between the lines indicating that females have higher stroke severity if everything else is held constant. Both genders experience higher stroke severity as the quantity of protein \(P_{351}\) increases. Unfortunately, the data set is rather small and there are only 9 observations used in the model meaning the addition of a single point that doesn’t fit the displayed trend could completely change the results. This is due to treating 0s as missing values for the protein data

Model Details
## 
## Call:
## lm(formula = response ~ predictor)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2039 -0.6026  0.2002  1.0103  1.7581 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -457.062     43.827 -10.429 4.56e-05 ***
## predictorFEMALE    9.868      1.262   7.821 0.000231 ***
## predictorP_351    24.500      2.294  10.682 3.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.68 on 6 degrees of freedom
##   (65 observations deleted due to missingness)
## Multiple R-squared:  0.9544, Adjusted R-squared:  0.9392 
## F-statistic: 62.74 on 2 and 6 DF,  p-value: 9.504e-05

Protein 351 represents 3 different proteins: P06493, Q07785, and P61075. The majority protein is P06493. Details of each are provided below.

Previously linked to Stroke severity

The proteins identified by our model are all Cyclin-Dependent Kinases (CDKs) which have previously been linked to stroke cases. While our model has failed to detect previously unknown biomarkers with regards to stroke severity, it has produced some evidence to support previous findings. The interested reader can learn more from the following links:

SNPs

The final data extracted as a part of this study were Single Polynucleotide Polymorphisms (SNPs) which were collected for each of the 400 participants. SNPs represent pieces of human genetic code (DNA) where substantial variability occurs and are particularly useful for identifying disease causing genes. These data were evaluated using a similar approach to that used for evaluating the proteomics data but had their own unique set of challenges:

  1. Data: The SNP data was spread across 6 different files which totaled nearly 10 GB of data. Additionally, the data was not stored in a model friendly format. Each file had to be wrangled into a format that could be passed into a model and the results needed to be combined into a single complete source of data. It took approximately 30 minutes of computer run time to extract and transform the details required. The result was a 1/2 GB file with 1 row for each patient (400) and one column for each SNP (654K)
  2. False Discovery: Similar to the proteomics data, 1 model had to be created for each SNP which resulted in 654K p-values. False discovery among these values was controlled by replacing p-values with q-values using the approach outlined in the Storey & Tibshirani (2003) paper entitled “Statistical significance for genomewide studies.”
  3. Computational Restrictions: Each model form considered had to be created 654K times (once for each SNP). Depending on the model complexity, this took 1 - 2 hours to create all 654K models for the selected format. As a result, models of interest had to be carefully selected based on the previously gathered details.

Linear Regression

The first set of models evaluated on the SNP data were linear regression models where the stroke severity (NIHSSTOT) was used as the response and the SNP was used as the predictor along with the hospital data as covariates. P-values for each model were extracted using ANOVA tests to determine statistical relevance of a given SNP. Unfortunately, The histograms of the p-values of the resulting models are uniformly distributed which is a key indication that none of the SNPs are truly statistically significant. This was confirmed by calculating the q-values which resulted in no statistically significant SNPs.

NIHSSTOT

Base Model

LM1 = lm( formula = NIHSSTOT ~ SNP , data = SNP_MODEL_DATA)

SNP_MODEL_RESULTS_LM1 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LM1.csv")
hist(SNP_MODEL_RESULTS_LM1$p_value)

Base + FEMALE

LM2 = lm( formula = NIHSSTOT ~ SNP + FEMALE , data = SNP_MODEL_DATA)

SNP_MODEL_RESULTS_LM2 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LM2.csv")
hist(SNP_MODEL_RESULTS_LM2$SNP_ANOVA_pvalue)

Base + PREV_ISC

LM3 = lm( formula = NIHSSTOT ~ SNP + PREV_ISC , data = SNP_MODEL_DATA)

SNP_MODEL_RESULTS_LM3 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LM3.csv")
hist(SNP_MODEL_RESULTS_LM3$SNP_ANOVA_pvalue)

Base + SHMRJYN

LM4 = lm( formula = NIHSSTOT ~ SNP + SHMRJYN , data = SNP_MODEL_DATA)

SNP_MODEL_RESULTS_LM4 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LM4.csv")
hist(SNP_MODEL_RESULTS_LM4$SNP_ANOVA_pvalue)

Base + MHDMY

LM5 = lm( formula = NIHSSTOT ~ SNP + MHDMY , data = SNP_MODEL_DATA)

SNP_MODEL_RESULTS_LM5 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LM5.csv")
hist(SNP_MODEL_RESULTS_LM5$SNP_ANOVA_pvalue)

Logistic Regression

To widen the SNP analysis, a new response was selected. Some regions of the US are prone to higher stroke rates and are said to reside within the “stroke belt.” Each patient’s hospital location was known and the binary variable STROKE_BELT was created to indicate if the patient resided in the stroke belt. Logistic regression was used to model the STROKE_BELT variable against the SNP data with the hospital data as covariates. P-values for each logistic regression model were extracted using ANOVA tests to determine statistical relevance of a given SNP. The histograms of the p-values of the resulting models do appear to have more values concentrated around 0 (a good sign) but when the corresponding q-values were calculated, there were no statistically significant results.

STROKE BELT

Base Model

LGR1 = glm( formula = STROKE_BELT ~ SNP , data = SNP_MODEL_DATA, family = binomial(link = “logit”))

SNP_MODEL_RESULTS_LGR1 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LGR1.csv")
hist(SNP_MODEL_RESULTS_LGR1$p_value)

min(SNP_MODEL_RESULTS_LGR1$q_value, na.rm = T)
## [1] 0.336149
Base Model + FEMALE

LGR2 = glm( formula = STROKE_BELT ~ SNP + FEMALE , data = SNP_MODEL_DATA, family = binomial(link = “logit”))

SNP_MODEL_RESULTS_LGR2 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LGR2.csv")
hist(SNP_MODEL_RESULTS_LGR2$p_value)

min(SNP_MODEL_RESULTS_LGR2$q_value, na.rm = T)
## [1] 0.3584762
Base Model + PREV_ISC

LGR3 = glm( formula = STROKE_BELT ~ SNP + PREV_ISC , data = SNP_MODEL_DATA, family = binomial(link = “logit”))

SNP_MODEL_RESULTS_LGR3 = read_csv("Project_Output/modelOutput/SNP_MODEL_RESULTS_LGR2.csv")
hist(SNP_MODEL_RESULTS_LGR3$p_value)

min(SNP_MODEL_RESULTS_LGR3$q_value, na.rm = T)
## [1] 0.3584762

Summary

This analysis tackled 3 different sources of data with the broad objective of finding any statistically significant details related to stroke severity. It focused on creating explanatory models and used an evidence focused approach. Evaluation of the hospital data resulted in a model that helped explain the difference in stroke severity variance between males and females. The model was able to explain approximately 16.5% of the variation in the entire data set. Analysis of the proteomics data again found that there are statistically significant differences in the stroke severity experienced by male and female patients and found evidence to back previous medical studies that connected Cyclin-Dependent Kinases (CDKs) to ischemic strokes. Finally, this study evaluated SNP data with the hopes of identifying genes that may cause strokes or that increase stroke severity and found that there was no evidence of either for the patients in the study.